-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build(deps): update Arrow/Parquet to 52.0
, object-store to 0.10
#10765
Conversation
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Okay, the one last thing is to update Cargo.toml after arrow |
Epic! |
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@waynexia It looks like you need to run |
common::collect(merged_aggregate.execute(0, task_ctx.clone())?).await?; | ||
// enlarge memory limit in spill mode | ||
let task_ctx = if spill { | ||
new_spill_ctx(2, 2600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be configurable? Also could you add a comment explaining what is happening here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya could you also review?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change has two stages to my understanding. The first partial aggr stage has a smaller limit, to meet the expectation that the partial plan has early emit.
datafusion/datafusion/physical-plan/src/aggregates/mod.rs
Lines 1504 to 1516 in 6846513
let expected = if spill { | |
vec![ | |
"+---+---------------+-------------+", | |
"| a | AVG(b)[count] | AVG(b)[sum] |", | |
"+---+---------------+-------------+", | |
"| 2 | 1 | 1.0 |", | |
"| 2 | 1 | 1.0 |", | |
"| 3 | 1 | 2.0 |", | |
"| 3 | 2 | 5.0 |", | |
"| 4 | 3 | 11.0 |", | |
"+---+---------------+-------------+", | |
] | |
} else { |
And the new lines enlarge the limit after that "early emit" check. Otherwise the merge aggr would fall because of insufficient memory. (at line 1554)
let result = common::collect(merged_aggregate.execute(0, task_ctx)?).await?;
The root cause of this change is somehow the memory requirement becomes larger. From my investigation, the biggest change compared to the previous is from the RawTable
in GroupValuesPrimitive
-- it needs about 1000 bytes more:
datafusion/datafusion/physical-plan/src/aggregates/group_values/primitive.rs
Lines 81 to 95 in 6846513
pub struct GroupValuesPrimitive<T: ArrowPrimitiveType> { | |
/// The data type of the output array | |
data_type: DataType, | |
/// Stores the group index based on the hash of its value | |
/// | |
/// We don't store the hashes as hashing fixed width primitives | |
/// is fast enough for this not to benefit performance | |
map: RawTable<usize>, | |
/// The group index of the null value if any | |
null_group: Option<usize>, | |
/// The values for each group index | |
values: Vec<T::Native>, | |
/// The random state used to generate hashes | |
random_state: RandomState, | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't realized that these literal values were just in tests ... thanks for the explanation
@@ -2032,7 +2037,7 @@ mod tests { | |||
spill: bool, | |||
) -> Result<()> { | |||
let task_ctx = if spill { | |||
new_spill_ctx(2, 3200) | |||
new_spill_ctx(2, 4200) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, seems like we need to make something configurable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be misunderstanding, do you mean the change in 3a33755 🤔 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I would like to understand the aggregate spilling changes more.
Thanks @waynexia
I merged up from main and updated the lock file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @waynexia -- I agree with @andygrove 's concern about the changes to spilling -- I think we should address them before merging but otherwise this looks really nice to me
🙏
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Thanks for reviewing! |
@waynexia I was inspecting the dependency tree for delta-rs and it seems the dependency didn't get bumped for object-store in
|
Sorry for causing this! Opened #10848 to fix it. |
All good, just wanted to bring it to your attention before 39 formally drops :) |
FYI @andygrove -- maybe this means we should make another RC for 39.0.0 🤔 We could also potentially make a 39.1.0 release too with this dependency upgrade fix #10848 |
…pache#10765) * fix compile on default feature config Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix test of common, functions, optimizer and physical-expr Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix other tests Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix one last test Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix clippy warnings Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix datafusion-cli Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * switch to git deps Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * regen proto file Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix pyo3 feature Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix slt Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix symmetric hash join cases Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * update integration result Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix up spill test Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * shift to the released packages Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * Update cargo.lock * Update datafusion/optimizer/src/analyzer/type_coercion.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * update document Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * move memory limit to parameter pos Signed-off-by: Ruihang Xia <waynestxia@gmail.com> --------- Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Closes #.
Rationale for this change
51.0.0
, tonic to0.11
#961352.0.0
arrow-rs#5688What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?